1. Problem definition

AirBnB is a marketplace for short term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Because most of the listings are on a short-term basis, AirBnB has grown to become a popular alternative to hotels. The company itself has grown from it's founding in 2008 to a 30 billion dollar valuation in 2016 and is currently worth more than any hotel chain in the world.

One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. In many areas, renters are presented with a good selection of listings and can filter on criteria like price, number of bedrooms, room type and more. Since AirBnB is a marketplace, the amount a host can charge on a nightly basis is closely linked to the dynamics of the marketplace. Here's a screenshot of the search experience on AirBnB:

As a host, if we try to charge above market price for a living space we'd like to rent, then renters will select more affordable alternatives which are similar to ours.. If we set our nightly rent price too low, we'll miss out on potential revenue.

One strategy we could use is to:

find a few listings that are similar to ours,
average the listed price for the ones most similar to ours,
set our listing price to this calculated average price.

The process of discovering patterns in existing data to make a prediction is called machine learning. In our case, we want to use data on local listings to predict the optimal price for us to set. In this mission, we'll explore a specific machine learning technique called k-nearest neighbors, which mirrors the strategy we just described. Before we dive further into machine learning and k-nearest neighbors, let's get familiar with the dataset we'll be working with.

2. Introduction to the data

While AirBnB doesn't release any data on the listings in their marketplace, a separate group named Inside AirBnB has extracted data on a sample of the listings for many of the major cities on the website. In this post, we'll be working with their dataset from October 3, 2015 on the listings from Washington, D.C., the capital of the United States. Here's a direct link to that dataset. Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area

To make the dataset less cumbersome to work with, we've removed many of the columns in the original dataset and renamed the file to dc_airbnb.csv. Here are the columns we kept:

host_response_rate: the response rate of the host
host_acceptance_rate: number of requests to the host that convert to rentals
host_listings_count: number of other listings the host has
latitude: latitude dimension of the geographic coordinates
longitude: longitude part of the coordinates
city: the city the living space resides
zipcode: the zip code the living space resides
state: the state the living space resides
accommodates: the number of guests the rental can accommodate
room_type: the type of living space (Private room, Shared room or Entire home/apt
bedrooms: number of bedrooms included in the rental
bathrooms: number of bathrooms included in the rental
beds: number of beds included in the rental
price: nightly price for the rental
cleaning_fee: additional fee used for cleaning the living space after the guest leaves
security_deposit: refundable security deposit, in case of damages
minimum_nights: minimum number of nights a guest can stay for the rental
maximum_nightss: maximum number of nights a guest can stay for the rental
number_of_reviews: number of reviews that previous guests have left

Let's read the dataset into Pandas and become more familiar with it.

Exercise Start.

Description:

Read dc_airbnb.csv into a Dataframe named dc_listings.
Use the print function to display the first row in dc_listings.



In [1]:

    
import numpy as np
import pandas as pd
import csv



In [3]:

    
dc_listings = pd.read_csv('dc_airbnb.csv')
print(dc_listings.loc[0])









    



host_response_rate                  92%
host_acceptance_rate                91%
host_listings_count                  26
accommodates                          4
room_type               Entire home/apt
bedrooms                              1
bathrooms                             1
beds                                  2
price                           $160.00
cleaning_fee                    $115.00
security_deposit                $100.00
minimum_nights                        1
maximum_nights                     1125
number_of_reviews                     0
latitude                          38.89
longitude                      -77.0028
city                         Washington
zipcode                           20003
state                                DC
Name: 0, dtype: object

3. K-nearest neighbors

Here's the strategy we wanted to use:

Find a few similar listings.
Calculate the average nightly rental price of these listings.
Set the average price as the price for our listing.

The k-nearest neighbors algorithm is similar to this strategy. Here's an overview:

There are 2 things we need to unpack in more detail:

the similarity metric
how to choose the k value

In this mission, we'll define what similarity metric we're going to use. Then, we'll implement the k-nearest neighbors algorithm and use it to suggest a price for a new, unpriced listing. We'll use a k value of 5 in this mission. In later missions, we'll learn how to evaluate how good the suggested prices are, how to choose the optimal k value, and more.

4. Euclidean distance

The similarity metric works by comparing a fixed set of numerical features, another word for attributes, between 2 observations, or living spaces in our case. When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

$\displaystyle d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \ldots + (q_n - p_n)^2}$

where $q_1$ to $q_n$ represent the feature values for one observation and $p_1$ to $p_n$ represent the feature values for the other observation. Here's a diagram that breaks down the Euclidean distance between the first 2 observations in the dataset using only the host_listings_count, accommodates, bedrooms, bathrooms, and beds columns:

In this mission, we'll use just one feature in this mission to keep things simple as you become familiar with the machine learning workflow. Since we're only using one feature, this is known as the univariate case. Here's how the formula looks like for the univariate case:

$\displaystyle d = \sqrt{(q_1 - p_1)^2}$

The square root and the squared power cancel and the formula simplifies to:

$ \displaystyle d = \left | q_1 - p_1 \right |$

The living space that we want to rent can accommodate 3 people. Let's first calculate the distance, using just the accommodates feature, between the first living space in the dataset and our own.

Exercise Start.

Description:

Calculate the Euclidean distance between our living space, which can accommodate 3 people, and the first living space in the dc_listings Dataframe.
Assign the result to first_distance and display the value using the print function.



In [5]:

    
max_people_accommodate = 3
first_distance = np.abs(dc_listings['accommodates'][0] - max_people_accommodate)
print(first_distance)

5. Calculate distance for all observations

The Euclidean distance between the first row in the dc_listings Dataframe and our own living space is 1. How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. This happens when the value for the feature is exactly the same for both observations you're comparing. If p1=q1, then $ \displaystyle d = \left | q_1 - p_1 \right |$ which results in $d=0$. The closer to 0 the distance the more similar the living spaces are.

If we wanted to calculate the Euclidean distance between each living space in the dataset and a living space that accommodates 8 people, here's a preview of what that would look like.

Then, we can rank the existing living spaces by ascending distance values, the proxy for similarity.

Exercise Start.

Description:

Calculate the distance between each value in the accommodates column from dc_listings and the value 3, which is the number of people our listing accommodates:
- Use the apply method to calculate the absolute value between each value in accommodates and 3 and return a new Series containing the distance values.
Assign the distance values to the distance column.
Use the Series method value_counts and the print function to display the unique value counts for the distance column.



In [32]:

    
def calc_distance(row,acc):
    accomodates = acc
    distance = sqrt((row['accommodates'] - accomodates)**2)
    
    return distance



In [33]:

    
distance = dc_listings.apply(lambda row: calc_distance(row,3), axis=1)



In [34]:

    
dc_listings['distance'] = distance
print(dc_listings['distance'].value_counts())









    



1.0     2294
2.0      503
0.0      461
3.0      279
5.0       73
4.0       35
7.0       22
6.0       17
9.0       12
13.0       8
8.0        7
12.0       6
11.0       4
10.0       2
Name: distance, dtype: int64

6. Randomizing, and sorting

It looks like there are quite a few, 461 to be precise, living spaces that can accommodate 3 people just like ours. This means the 5 "nearest neighbors" we select after sorting all will have a distance value of 0. If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

dc_listings[dc_listings["distance"] == 0]["accommodates"]
26      3
34      3
36      3
40      3
44      3
45      3
48      3
65      3
66      3
71      3
75      3
86      3
...

Let's instead randomize the ordering of the dataset and then sort the Dataframe by the distance column. This way, all of the living spaces with the same number of bedrooms will still be at the top of the Dataframe but will be in random order across the first 461 rows. We've already done the first step of setting the random seed, so we can perform answer checking on our end.

Exercise Start.

Description:

Randomize the order of the rows in dc_listings:
- Use the np.random.permutation() function to return a NumPy array of shuffled index values.
- Use the Dataframe method loc[] to return a new Dataframe containing the shuffled order.
- Assign the new Dataframe back to dc_listings.
After randomization, sort dc_listings by the distance column.
Display the first 10 values in the price column using the print function.



In [9]:

    
np.random.seed(1)

shuffled_indexes = np.random.permutation(len(dc_listings))
shuffled_dc_listings = dc_listings.loc[shuffled_indexes]
dc_listings = shuffled_dc_listings



In [11]:

    
dc_listings.head()









    Out[11]:






  
    
      
      host_response_rate
      host_acceptance_rate
      host_listings_count
      accommodates
      room_type
      bedrooms
      bathrooms
      beds
      price
      cleaning_fee
      security_deposit
      minimum_nights
      maximum_nights
      number_of_reviews
      latitude
      longitude
      city
      zipcode
      state
      distance
    
  
  
    
      574
      100%
      100%
      1
      2
      Private room
      1.0
      1.0
      1.0
      $125.00
      NaN
      $300.00
      1
      4
      149
      38.913548
      -77.031981
      Washington
      20009
      DC
      1.0
    
    
      1593
      87%
      100%
      2
      2
      Private room
      1.0
      1.5
      1.0
      $85.00
      $15.00
      NaN
      1
      30
      49
      38.953431
      -77.030695
      Washington
      20011
      DC
      1.0
    
    
      3091
      100%
      NaN
      1
      1
      Private room
      1.0
      0.5
      1.0
      $50.00
      NaN
      NaN
      1
      1125
      1
      38.933491
      -77.029679
      Washington
      20010
      DC
      2.0
    
    
      420
      58%
      51%
      480
      2
      Entire home/apt
      1.0
      1.0
      1.0
      $209.00
      $150.00
      NaN
      4
      730
      2
      38.904054
      -77.051991
      Washington
      20037
      DC
      1.0
    
    
      808
      100%
      95%
      3
      12
      Entire home/apt
      5.0
      2.0
      5.0
      $215.00
      $135.00
      $100.00
      2
      1825
      34
      38.906118
      -76.988873
      Washington
      20002
      DC
      9.0



In [12]:

    
dc_listings.sort_values(by='distance', inplace=True)



In [13]:

    
print(dc_listings['price'][:10])









    



577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object

7. Average price

Before we can select the 5 most similar living spaces and compute the average price, we need to clean the price column. Right now, the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

Exercise Start.

Description:

Remove the commas (,) and dollar sign characters ($) from the price column:
- Use the str accessor so we can apply string methods to each value in the column followed by the string method replace to replace all comma characters with the empty character: stripped_commas = dc_listings['price'].str.replace(',', '')
- Repeat to remove the dollar sign characters as well.
Convert the new Series object containing the cleaned values to the float datatype and assign back to the price column in dc_listings.
Calculate the mean of the first 5 values in the price column and assign to mean_price.
Use the print function or the variable inspector below to display mean_price



In [14]:

    
#Adjusting data
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_commas = stripped_commas.str.replace('$','')



In [15]:

    
#Series cast
dc_listings['price'] = stripped_commas.astype(float)



In [16]:

    
mean_price = dc_listings['price'][:5].mean()
print(mean_price)

8. Function to make predictions

Congrats! You've just made your first prediction! Based on the average price of other listings that accommdate 3 people, we should charge 156.6 dollars per night for a guest to stay at our living space. In the next mission, we'll dive into evaluating how good of a prediction this is.

Let's write a more general function that can suggest the optimal price for other values of the accommodates column. The dc_listings Dataframe has information specific to our living space, e.g. the distance column. To save you time, we've reset the dc_listings Dataframe to a clean slate and only kept the data cleaning and randomization we did since those weren't unique to the prediction we were making for our living space.

Exercise Start.

Description:

Write a function named predict_price that can use the k-nearest neighbors machine learning technique to calculate the suggested price for any value for accommodates. This function should:
- Take in a single parameter, new_listing, that describes the number of bedrooms.
- Assign dc_listings to a new Dataframe named temp_df so we aren't constantly modifying the original dataset each time we call the function.
- Calculate the distance between each value in the accommodates column and the new_listing value that was passed in. Assign the resulting Series object to the distance column in temp_df.
- Sort temp_df by the distance column and select the first 5 values in the price column. Don't randomize the ordering of temp_df.
- Calculate the mean of these 5 values and use that as the return value for the entire predict_price function.
Use the predict_price function to suggest a price for a living space that:
- accommodates 1 person, assign the suggested price to acc_one.
- accommodates 2 people, assign the suggested price to acc_two.
- accommodates 4 people, assign the suggested price to acc_four.



In [21]:

    
# Brought along the changes we made to the `dc_listings` Dataframe.
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]



In [35]:

    
def predict_price(new_listing):
    auxdf = pd.DataFrame(data=dc_listings)
    
    distance = auxdf.apply(lambda row: calc_distance(row, new_listing), axis=1)
    auxdf['distance'] = distance
    auxdf.sort_values(by='distance', inplace=True)

    return auxdf['price'][:5].mean()



In [36]:

    
acc_one  = predict_price(1)
acc_two  = predict_price(3)
acc_four = predict_price(4)

print("Accommodates 1 person: " + str(acc_one))
print("Accommodates 2 person: " + str(acc_two))
print("Accommodates 4 person: " + str(acc_four))









    



Accommodates 1 person: 90.6
Accommodates 2 person: 128.8
Accommodates 4 person: 180.8

9. Next steps

In this mission, we explored the problem of predicting the optimal price to list an AirBnB rental for based on the price of similar listings on the site. We stepped through the entire machine learning workflow, from selecting a feature to testing the model. To explore the basics of machine learning, we limited ourselves to only using one feature (the univariate case) and a fixed k value of 5.

In the next mission, we'll learn how to evaluate a model's performance.

	host_response_rate	host_acceptance_rate	host_listings_count	accommodates	room_type	bedrooms	bathrooms	beds	price	cleaning_fee	security_deposit	minimum_nights	maximum_nights	number_of_reviews	latitude	longitude	city	zipcode	state	distance
574	100%	100%	1	2	Private room	1.0	1.0	1.0	$125.00	NaN	$300.00	1	4	149	38.913548	-77.031981	Washington	20009	DC	1.0
1593	87%	100%	2	2	Private room	1.0	1.5	1.0	$85.00	$15.00	NaN	1	30	49	38.953431	-77.030695	Washington	20011	DC	1.0
3091	100%	NaN	1	1	Private room	1.0	0.5	1.0	$50.00	NaN	NaN	1	1125	1	38.933491	-77.029679	Washington	20010	DC	2.0
420	58%	51%	480	2	Entire home/apt	1.0	1.0	1.0	$209.00	$150.00	NaN	4	730	2	38.904054	-77.051991	Washington	20037	DC	1.0
808	100%	95%	3	12	Entire home/apt	5.0	2.0	5.0	$215.00	$135.00	$100.00	2	1825	34	38.906118	-76.988873	Washington	20002	DC	9.0